



## INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT)

|                                                                                                                                                                   |    |                                                                                                                                                                                                                                                                                                                               |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| (51) International Patent Classification 5 :<br>G11C 8/00                                                                                                         | A1 | (11) International Publication Number: WO 92/07361<br>(43) International Publication Date: 30 April 1992 (30.04.92)                                                                                                                                                                                                           |
| (21) International Application Number: PCT/US91/07256                                                                                                             |    | (81) Designated States: AT (European patent), BE (European patent), CH (European patent), DE (European patent), DK (European patent), ES (European patent), FR (European patent), GB (European patent), GR (European patent), IT (European patent), JP, KR, LU (European patent), NL (European patent), SE (European patent). |
| (22) International Filing Date: 10 October 1991 (10.10.91)                                                                                                        |    |                                                                                                                                                                                                                                                                                                                               |
| (30) Priority data:<br>595,350 10 October 1990 (10.10.90) US                                                                                                      |    | Published<br>With international search report.<br>With amended claims.                                                                                                                                                                                                                                                        |
| (71) Applicant: HAL COMPUTER SYSTEMS, INC. [US/US];<br>1315 Dell Avenue, Campbell, CA 95008 (US).                                                                 |    |                                                                                                                                                                                                                                                                                                                               |
| (72) Inventor: WILLIAMS, Ted, E. ; 1971 W. Middlefield<br>Road, #4, Mountain View, CA 94043 (US).                                                                 |    |                                                                                                                                                                                                                                                                                                                               |
| (74) Agents: SHERIDAN, James, A. et al.; Flehr, Hohbach,<br>Test, Albritton & Herbert, Four Embarcadero Center,<br>Suite 3400, San Francisco, CA 94111-4187 (US). |    |                                                                                                                                                                                                                                                                                                                               |

## (54) Title: ZERO OVERHEAD SELF-TIMED ITERATIVE LOGIC



BEST AVAILABLE COPY

## (57) Abstract

CMOS domino logic is normally used only in two phases: precharge and logic evaluation. The invention uses a third phase to store data, which allows domino logic gates (1, 2, 3) to be cascaded and pipelined without intervention latches. The inputs (Data In) to this system must have strictly monotonic transitions during the logic evaluation phase and the precharge signal must be active during only the precharge phase. Furthermore, the pipelined system can feed its output back to the input to form an iterative structure. Such a feedback pipeline is viewed as a "loop" or "ring" of logic which circulates data until the entire computation is complete.

**FOR THE PURPOSES OF INFORMATION ONLY**

Codes used to identify States party to the PCT on the front pages of pamphlets publishing international applications under the PCT.

|     |                          |    |                                          |     |                          |
|-----|--------------------------|----|------------------------------------------|-----|--------------------------|
| AT  | Austria                  | ES | Spain                                    | MG  | Madagascar               |
| AU  | Australia                | FI | Finland                                  | ML  | Mali                     |
| BB  | Barbados                 | FR | France                                   | MN  | Mongolia                 |
| BE  | Belgium                  | GA | Gabon                                    | MR  | Mauritania               |
| BF  | Burkina Faso             | GB | United Kingdom                           | MW  | Malawi                   |
| BG  | Bulgaria                 | GN | Guinea                                   | NL  | Netherlands              |
| BJ  | Benin                    | GR | Greece                                   | NO  | Norway                   |
| BR  | Brazil                   | HU | Hungary                                  | PL  | Poland                   |
| CA  | Canada                   | IT | Italy                                    | RO  | Romania                  |
| CF  | Central African Republic | JP | Japan                                    | SD  | Sudan                    |
| CG  | Congo                    | KP | Democratic People's Republic<br>of Korea | SE  | Sweden                   |
| CH  | Switzerland              | KR | Republic of Korea                        | SN  | Senegal                  |
| CI  | Côte d'Ivoire            | LI | Liechtenstein                            | SU* | Soviet Union             |
| CM  | Cameroon                 | LK | Sri Lanka                                | TD  | Chad                     |
| CS  | Czechoslovakia           | LU | Luxembourg                               | TG  | Togo                     |
| DE* | Germany                  | MC | Mosambique                               | US  | United States of America |
| DK  | Denmark                  |    |                                          |     |                          |

+ Any designation of "SU" has effect in the Russian Federation. It is not yet known whether any such designation has effect in other States of the former Soviet Union.

### ZERO OVERHEAD SELF-TIMED ITERATIVE LOGIC

#### FIELD OF THE INVENTION

This invention relates to digital electronic circuits, and, more particularly, to self-timed circuits, including iterative division algorithms. The design technique of the present patent is called "Zero-Overhead Self-Timed Iterative Logic," abbreviated ZOSTIL.

#### BACKGROUND OF THE INVENTION

The timing performance of any system can be judged by one of two measures: latency or throughput. The delay from an input to the resulting output is called the latency, and most real world problems desire this delay to be minimized. If a system can have several computations in progress at once, then the minimum delay between two successive inputs determines the throughput, which is the maximum data rate at which the system can accept requests for computation. Performance assessed by either of these measures depends on the sum of the raw propagation delay through the combinational logic of the desired function plus "other" overhead delays. From a theoretical point of view, the fastest circuit would eliminate all overheads and have circuit delays due to only the raw combinational logic. The innovations in this patent reduce the latency overhead in a pipeline to zero. Hence, the ZOSTIL innovation will produce functions whose latency attains the theoretical lower bound, but without requiring the large and costly area of a full combinational array.

-2-

Traditional synchronous circuit design techniques separate combinational logic from data storage. That is, storage is provided by explicit latches interposed between sections of combinational logic. This design technique 5 has at least four sources of overhead which increase circuit latency: 1) propagation delay through latches; 2) margin added to tolerate clock skew; 3) wasted time in fast stages within the system; 4) maximizing data-dependent delay; and 5) the assumption of worst case 10 timing of components.

The first source of latency overhead is due to latches because they introduce additional delays due to their set-up time and propagation delays. The minimum cycle time of a synchronous circuit is the sum of the 15 latch set-up time, latch propagation delay, and maximum combinational logic delay. The first innovation in the ZOSTIL methodology is remove this overhead completely by removing the explicit latches altogether and making use of the "free" half-latch at the output of each stage in a 20 CMOS domino chain.

The second source of latency overhead comes from needing to distribute the clock to all latches in the system. Communicating stages must be in agreement as to when the clock edges occur, but wire or driver delays 25 cause clock skew which must be compensated for by adding some margin to the total clock period. This added margin is also overhead. Previous asynchronous design techniques used handshaking blocks to remove global clocks and the extra latency overhead due to clock skew by communicating 30 data validity locally instead of globally. But these previous techniques include explicit latches, and hence, still had the latency overhead due to latch propagation delays. Previous techniques also added some overhead due to the forward directed paths within the handshaking 35 logic. The second ZOSTIL innovation is to insure all

-3-

control paths operate in parallel with the forward evaluation rather than adding sequentially to the path.

The third source of latency overhead is due to mismatching of the functional sections between the 5 latches. Because the amount of time in a clock period is fixed, it must be set equal to the longest propagation delay of all of the different functional sections in the system. The difference between that maximum and the actual time used by any functional section is overhead 10 because it is wasted time. A self-timed dataflow does not waste this time because it allows data to flow forward based on data-driven local control, rather than waiting for clock edges. Although the throughput of a pipeline is still limited by its slowest stage, the latency is 15 improved by letting each stage progress as soon as it can.

The fourth source of latency overhead comes from determining critical paths in synchronous logic based on the worst-case data values. If there is a large variance then there is a large performance loss due to the difference between the average and maximum values of delay. 20 Synchronous designers try to adjust transistor sizing to equalize the various paths through a body of logic, but in self-timed systems it is desired to minimize the probabilistic expected value of the delay rather than minimizing 25 the maximum delay. The third innovation of this patent is to make use of any known probabilistic distribution of the inputs of each block of logic in order to size the transistors in that block to minimize the expected value of the total delay.

30 The fifth source of latency overhead is the derating used to insure performance over a range of temperature and voltage levels. Synchronous system design must always be based on conservative derated "worst-case" specifications because the system must work at the environmental

**SUBSTITUTE SHEET**

-4-

extremes. But when the actual conditions are not at the extremes, the difference between the possible performance and the actual designed performance is wasted performance. Self-timed components will always run at their maximum speed for the existing conditions and deliver their outputs as soon as they are actually finished. By providing completion indication, they allow an enclosing system to make use of the output sooner than always waiting for the worst case.

10 Background and Nomenclature for Dual-Monotonic Signals

If A is a dual-monotonic signal, it is represented by two "sub-signals", called  $A^0$  and  $A^1$ , with the encoding: if both of the wires are in the same logical state, say low, then the signal A has not yet evaluated; if either  $A^0$  or  $A^1$  changes state, this communicates the signal A has finished evaluating, and the state of A is determined by noting which of the two wires changed. For example, if both  $A^0$  and  $A^1$  have the binary value '0', then the value of the signal, A, is not yet determined. If  $A^1$  transitions to '1', then the value of A is '1', while if  $A^0$  transitions to '1', then the value of A is '0'. The pair of wires is called a dual-monotonic pair because the transitions on the wires must be monotonic during evaluation. These transitions are mutually exclusive, and either one indicates the evaluation is complete and can be used by other circuits. In this patent, signal names are italicized, and a "\*" is used to indicate logical inversion. Also, each half of a dual-monotonic signal will have a superscript of 1 or 0.

30 Background on Domino Logic

Monotonic signals can be conveniently generated by CMOS domino logic. Each signal can be in one of three functional phases: 1) precharge or reset, 2) logic evaluation, or 3) data storage. These three phases are 35 shown in Figure 1, which shows a two-input dual-monotonic

-5-

AND gate and its waveform diagram. During the reset phase, the active low precharge signal,  $P^*$ , is active and the A and S signals must be inactive. This causes the precharged nodes  $X^*$  and  $Y^*$  to be high, and the Q outputs, 5 to be low. In the logic evaluation phase, either  $A^0$  or  $A^1$  and either  $B^0$  or  $B^1$  will transition high monotonically. If both  $A^0$  and  $B^0$  transition high, the AND gate's  $Q^1$  output 10 monotonically transitions high, and if either  $A^0$  and  $B^0$  go active, the  $Q^0$  output will go high. During the data storage phase, both A and S signals are forced low, and  $P^*$  15 remains inactive. This condition leaves the precharged nodes  $X^*$  and  $Y^*$  undriven, and capacitance causes them to act as a memory element so the outputs,  $Q^1$  and  $Q^0$ , remain in the same state as they were during the logic evaluation phase. Thus, each domino stage includes a "free" half-latch because no additional transistors and no additional logic delays are needed to store data.

#### Overview of the Innovations

CMOS domino logic is normally used only in two 20 phases: precharge and logic evaluation. The invention of the present patent uses a third phase to store data, which allows domino logic gates to be cascaded and pipelined without intervening latches. The inputs to this system must have strictly monotonic transitions during the logic 25 evaluation phase and the precharge signal must be active during only the precharge phase. Furthermore, the pipelined system can feed its output back to the input to form an iterative structure. Such a feedback pipeline is viewed as a "loop" or "ring" of logic which circulates 30 data until the entire computation is complete.

The innovation of making use of the temporary storage of a precharged function block allows the explicit latches to be omitted. Each domino stage provides the operation of a half-latch for free. The Reset Control logic operates 35 completely in parallel with the function block

-6-

evaluation. Completion detection logic in each Reset Control block observes the output of the following Function Block to determine when all of its outputs have finished evaluating and then instructs its own Function Block to move from the data storage phase to the precharge phase, driving all its outputs to the reset state. When the outputs of the following Function Block subsequently become reset, the Reset Control turns off the precharge signal for its Function Block, causing it to be ready for the data evaluation phase when its next data input actually arrives. By encoding the data in dual-monotonic pairs, there is no forward handshake required and thus the control logic is removed from the critical path of the circuit. This innovative methodology, in conjunction with the first innovation removing the need for explicit latches, yields a truly zero overhead minimum latency delay path through pipelined logic.

The ZOSTIL technique includes combining the latch-free circuits and parallel Reset Control into an iterative structure, or "Ring." This is particularly important for arithmetic operations which perform the same basic function over and over. Examples of these type of functions are: multiplication, division, square root, sine, and cosine.

ZOSTIL circuits are robust because, with proper design of the control logic, they are delay-independent. That is, the circuits will function correctly regardless of the actual delays of the circuit elements. Therefore, calculations involving delays are not necessary to insure the logical correctness or functionality of the system, but are used only to estimate the performance. This contrasts to synchronous design techniques which require extensive delay calculations to insure all computations within a single logic stage can be performed in one clock cycle. Improper delay estimation may result in a synch-

-7-

ronous circuit which does not always produce the correct result.

Division algorithms generate a quotient by successive determination of quotient digits from most significant to 5 least significant. Because each quotient digit is used in the computation of the next partial remainder, which in turn is required to determine the next quotient digit, division is an inherently sequential process. Hence, a pipelined ring designed with the ZOSTIL technique is ideal 10 for performing arithmetic division. An additional innovation specific to division is to overlap and interlock stages to allow two remainder computations to occur in parallel. This is accomplished by modifying an algorithm, known as SRT division, to perform several small remainder 15 computations in parallel and choose the correct remainder when the quotient digit from the previous stage is determined. This innovation improves the overall latency by a factor of two in comparison with the previous algorithms.

BRIEF DESCRIPTION OF THE DRAWINGS

20 The present invention will be better understood by reviewing this description given with reference to the following drawings:

Figure 1: Self-Timed Domino Logic AND Gate. This is a schematic of a two-input dual-monotonic self-timed AND 25 gate constructed in CMOS technology;

Figure 2: Precharged Function Blocks. This is a linear pipeline of precharged function blocks which also includes logic for completion detection, which is used to reset the function blocks;

30 Figure 3: Datapaths Merging. This is a schematic showing the Control Reset Logic needed to merge two self-timed pipelines;

-8-

Figure 4: Datapaths Splitting. This is a schematic showing the Control Reset logic needed to split a self-timed pipeline;

5 Figure 5: Improving Expected Total Delay. This shows how changing the circuit topology without changing the function can improve the expected circuit performance;

Figure 6: Logic for Four Stage Self-Timed Ring. This is a four stage pipeline ring of precharged function blocks;

10 Figure 7: Dependency Graph for Four Stage Self-Timed Ring. This is dependency graph of the schematic shown in Figure 6;

Figure 8: Sequential Data-flow in Ordinary Radix 2 SRT Division. This is schematic showing the dataflow in 15 one stage of previously described SRT division algorithm; and

Figure 9: Intra-stage Overlapped Execution of Radix 2 SRT Division. This is a schematic showing the improvement in SRT division.

20 DETAILED DESCRIPTION OF THE INNOVATIONS

This patent develops innovations in asynchronous circuit design technique leading to "Zero-Overhead Self-Timed Iterative Logic," abbreviated ZOSTIL.

The ZOSTIL Technique

25 Asynchronous circuits have the potential for avoiding the latency overheads of synchronous circuits. By communicating completion status along with the data, each processing element can begin to operate on data as soon as it arrives without waiting for re-synchronization to a 30 global clock at every latch.

Previous implementations of asynchronous logic used explicit latches to store intermediate results. These latches introduce additional propagation delay to the circuit's critical path, but do not directly contribute to

-9-

the computational function. The first innovation of this patent is to avoid explicit latches entirely by using CMOS domino function blocks as "free" half-latches. This is possible only if the control for the function block pre-  
5 charge makes certain the outputs from a function block have been utilized by all subsequent stages before resetting a function block and destroying the data. In order to determine when succeeding function blocks are finished using the data, it is necessary to construct a completion  
10 detector.

A simple OR gate connected to the two wires of each dual-monotonic pair output provides a done indicator for each individual signal output from a logic stage. The stage is considered done computing when all of its data  
15 outputs are individually done. A tree of last-of gates, commonly called C-elements, can be used as the completion detector to determine when all of the bits in a datapath have changed. Each C-element has the property that its output is that of the inputs when they were last the same.  
20 The output of a tree of C-elements will indicate done when all of the inputs are done and the output will indicate reset when all of the inputs have reset.

Once the completion signals are generated, they must be used to provide the control for resetting the pre-  
25 charged blocks appropriately. Previous self-timed circuits required both forward and backward handshakes, but the second innovation of this patent is to completely embed the completion indication of the forward data in dual-monotonic pairs and to eliminate the forward handshake. Further, the backward handshake can be designed so  
30 that it does not affect the forward critical path. For a simple unidirectional data flow, this control is a sequence of backward pointing inverters as shown in Figure 2. For merging and splitting datapaths, the control  
35 is as shown in Figures 3 and 4. None of these con-

-10-

trol circuits are in the direct path of the forward flowing data and hence they do not add to the latency of the forward flowing wave of data.

5        The control logic resets each function block before the next wave of data comes along and causes the function block to evaluate its outputs again. As long as the data waves are spaced apart, each data wave will propagate with latency equal only to the pure combinational delay, without any additional overhead.

10       If a problem requires repetitive execution of a logical function, then it is particularly appropriate to build a ring of precharged function blocks. The function blocks can be the same, or they may implement different functions. The width of the datapath between stages of 15      the ring need not be constant. The data in a self-timed ring loops around the ring at the same speed as it could progress through a large combinational array, but the silicon area of the circuit is much reduced. A physical analogy to this is a circle of dominoes. The trick is to 20      make the wave of falling dominoes progress around the circle continuously at the same speed as it would down a long row of dominoes, and this is accomplished by standing each domino back up after its successor has fallen.

25       Usually a linear pipeline is judged by the throughput of the stages. But when the stages are connected into a ring to solve a single iterative problem, the time it takes to compute the answer is dependent on the latency through the stages. So having low latency is the important property for the stages in a loop rather than 30      throughput. Zero overhead control logic reduces the latency to the lower bound of strictly the combinational delay of the function blocks.

Dependency Graphs Verifying Zero Overhead

-11-

The ZOSTIL technique results in circuits which function correctly independent of the actual delays taken by each of the blocks. The designs are thus robust since changes in the delays will not affect the logical operation of the circuit. But, the actual delays determine the overall performance, and relative delays determine which path through the circuit is the limiting, or critical, path. The objective of "Zero-overhead" design is to make sure that the performance is limited only by the function blocks comprising the desired combinational logic, and hence that the critical path under nominal relative delays does not go through any control blocks. If the performance analysis shows the circuit does not achieve zero overhead, then the schematic can be modified by making the stages more finely grained until zero overhead latency is achieved.

To insure a design has no overhead due to control, dependency graphs are drawn to illustrate all possible critical paths. A simple ring and its dependency graph are shown in Figures 6 and 7. The critical cycle time of data flowing around the ring will be the longest cyclic path in the graph. The graph is really a restricted Petri-net and the firing rules are the same; a node is marked when all of its predecessors have been marked.

25                    Adjusting Logic and Transistor Sizing  
                      Based on Data Probabilities

Since self-timed circuits accompany data with completion signals, subsequent computations may begin as soon as each data arrives. Since it is not required that processing times are the same, what is really desired is that the total expected value of delay is minimized. In cases where data values are distributed with equal probability, the expected value is, of course, minimized when the average delay is minimized. However, in some cases, the designer may know that data values will have a particular

-12-

distribution, and this information can be used to minimize the total expected value to be better than just the average of all data value delays.

Paths that are known to have higher than average usage can be made faster by shortening the number of logic blocks they contain or by widening the transistors so that the blocks go faster. For example in Figure 5, a net improvement in the expected value of delay will result if, by inverting both arms of a multiplexor, an inverter is removed from the arm known to be more frequently chosen, even though this results in an inverter being added to the other arm. Likewise, if some output of a block must be loaded with transistors beginning two different paths, narrowing the transistors in the infrequently chosen path will slow that path, but will result in an overall improvement in expected delay because the output node which was also part of the frequently chosen path will be faster due to less loading.

Self-timed SRT Division

Performing division requires making a choice of quotient digits starting with the most significant, and progressing to the least significant, digits. The quotient digit decision is made as a part of each iteration which recomputes the next partial remainder based on the previous partial remainder and quotient digit. Between each iteration, the partial remainder is shifted left by the base, or radix  $r$ , of the digits being used. Each iteration thus implements

$$R_{i+1} = rR_i - Dq_i$$

where  $R_i$  is the partial remainder output from stage  $i$ ,  $r$  is the radix,  $q_i$  is the quotient digit determined from stage,  $D$  is the Divisor, and the sequence is initialized with  $rR_0 =$  the Dividend.

-13-

In ordinary division, the quotient digits  $q_i$  are in the set  $\{0, \dots, r-1\}$ , and the full quotient has only a single valid representation since each digit position in the quotient has only a single correct representation.

5 Unfortunately, determining the correct digit at each position requires comparison of the exact partial remainder, and this means the entire partial remainder must be computed before determining each quotient digit. This computation requires a complete carry-propagate subtract

10 to generate the partial remainder before each quotient digit may be selected.

One published algorithm for division is known as the SRT algorithm. The key idea of SRT division is to avoid a complete carry propagation in each iteration by making the

15 set of valid quotient digits redundant by including both positive and negative integers in the set  $\{p, \dots, 0, \dots, -p\}$ . The range of quotient digits must have

$$\frac{x}{2} \leq p \leq x-1 .$$

With redundant quotient digit sets, the final quotient result can be represented in several different ways,

20 giving a choice of quotient digits for each position. Any valid representation can always, of course, be converted to the desired irredundant representation by subtracting the positionally weighted negative quotient digits from the positionally weighted positive digits. This subtraction requires a carry propagation, but it is a single operation which needs only to be performed once for the whole division operation rather than once per stage. Further, in an integrated floating-point chip, this full-length carry-propagate operation could be performed by

25 30 shipping the quotient results to a separate part of the chip implementing fast carry-look-ahead addition.

-14-

Since, in SRT division, the quotient set contains digits of both signs, the quotient selection logic for a given position need only use an approximation of the divisor and partial remainder. This is because small 5 errors may be corrected at a later stage with less significant quotient digits of the opposite sign. Because only an approximation of the partial remainder is required at each stage for the selection of quotient digits, only a small number of the most significant bits of the partial 10 remainder need to be examined.

The simplest form of SRT division is to use radix  $r=2$  with only three quotient digits: +1, 0, -1. This requires looking at only the top four bits of remainder at each stage in order to make the correct quotient digit selection. The ordinary sequential data-flow for each stage of 15 this algorithm is shown in Figure 8. In discrete implementations, higher radices such as  $r=4$  and  $r=16$  have routinely been used.

The probabilistic distribution of quotient digits is 20 not uniform due to the numerical properties of SRT division. In the radix 2 case, the three quotient digits have probabilities of 42%, 35%, and 23%; and 4% of the time it is even possible to predict two quotient digits in advance. The sign bit of the internal partial remainders 25 has a 77% probability of being on, even for uniformly distributed input operands. These statistics are used to speed the more frequently used circuit paths since the self-timed implementation which can take advantage of the improvement.

30 Intra-stage Overlapped Execution Innovation

Prior to this innovation, the steps of the SRT division algorithm have been regarded as being purely sequential. In this patent, the steps within each stage of the algorithm are overlapped which makes it faster by

-15-

allowing additional parallelism. The data flow for this innovation is shown in Figure 9. Specifically, the partial 4-bit carry-save and carry-propagate adders for the remainder formation in each stage can operate in parallel with the previous quotient digit selection and the stages own divisor multiple multiplexor and 54-bit carry-save adder. One of the inputs to the partial adders used to be the chosen divisor multiple from the previous stage, which required knowing the selected quotient digit.

5 But if the partial adders operate in parallel, then the quotient digit is not yet determined. Instead, the innovation is to duplicate the partial adders for each of the possible quotient digits, allowing them to begin computation earlier and in parallel, and then choose 10 between their results when the quotient digit from the previous stage catches up. Since there are three possible quotient digits, there needs to be a path for each possibility. Fortunately, since one of the quotient digits is zero, there need be only two partial carry-save adders.

15 20 This innovation trims the average propagation delay per stage by approximately one-half because the delay is dominated by the carry-propagate adder, and the intra-stage overlapped execution allows the carry-propagate additions in two successive stages to be executing simultaneously.

The innovation of intra-stage overlapped execution of SRT division can be combined with the ZOSTIL innovation by self-timing a sequence of stages, each having the data flow of Figure 9. This requires using the merge and join 30 constructs presented in Figures 3 and 4. A loop of four of these stages will repetitively operate as fast as if the logic for the stages were assembled into a prohibitively large combinational array.

-16-

WHAT IS CLAIMED IS:

1. A zero-overhead self-timed iterative logic circuit utilizing CMOS domino function blocks incorporating a completion detector associated with each said function block for determining that succeeding function blocks are finished using the data from said function block, a function block precharge means being responsive to said completion detector to precharge said function block only when the outputs have been utilized by succeeding blocks, whereupon the function block is reset by said precharge means and the data therein is destroyed.
2. Pipelines or rings that can keep data tokens separated as they flow through series of precharged stages without explicit latches.
3. Logic that iterates with zero-overhead over the raw combinational latencies of its function blocks.
4. The folded dependency graph analysis method used to analyze critical paths in iterating asynchronous circuits.
5. A methodology for designing pipelines or rings of function blocks that continuously propagate data forward at the speed of a combinational array.
6. Asynchronous rings that "wrap" a combinational array onto a small set of stages, achieving the performance of the array without its area.
7. A self-timed integrated circuit for division which iterates at the speed of a combinational array, while requiring the instantiation of only a few stages.

## AMENDED CLAIMS

[received by the International Bureau on 24 March 1992 (24.03.92) :  
original claims 1-7 replaced by amended claims 1-41 (12 pages)]

1. Apparatus for logically evaluating signals and providing evaluated signals, comprising:
  - a first output terminal, said first output terminal operating in a reset phase, a evaluation phase and a storage phase;
  - 5 first resetting means, coupled to said first output terminal, for resetting said first output terminal in a resetting phase; and
  - a first evaluating means, coupled to said first output terminal, for evaluating at least one signal in said evaluation phase;
  - 10 wherein said first resetting means and said first evaluating means are in inactive states during said storage phase, and said first output terminal provides said evaluated signals during said storage phase.
2. The apparatus of claim 1, further comprising:
  - means, coupled to said first output terminal, for sustaining said evaluated signals.
3. The apparatus of claim 2, wherein said evaluating means receives evaluates first states of at least two dual-monotonic signals.
4. The apparatus of claim 3, further comprising:
  - a second output terminal, said second output terminal operating in said reset, evaluation and said storage phases;
  - 5 second resetting means, coupled to said second output terminal, for resetting said second output terminal in said resetting phase; and
  - a second evaluating means, coupled to said second output terminal, for evaluating said second states of said at least two dual-monotonic signals in said evaluation phase;
- 10

wherein said second resetting means and said second evaluating means are in inactive states during said storage phase, and said second output terminal sustains and provides said evaluated monotonic signals during said 5 storage.

5. The apparatus of claim 4, further comprising: means, coupled to said second output terminal, for sustaining said evaluated signals.

6. The apparatus of claim 5, wherein said first evaluating means includes two transistor means, two control terminals of said two transistor means receive said first states of two 5 dual-monotonic signals, and wherein said two transistor means are connected in series to said first control terminal; and

wherein said second evaluating means includes two transistor means, two control terminals of said two transistor means receiving said second states of two 10 dual-monotonic signals, and wherein said two transistor means are connected in parallel to said second control terminal.

7. A method in use with an apparatus, said apparatus including a first output terminal, first resetting means coupled to said first output terminal and a first evaluating means coupled to said first output terminal, 5 said method comprising the steps of:

resetting said first output terminal in a resetting phase;

evaluating said first terminal in a evaluating phase;

10 setting said first resetting means and first setting in inactive states; and

sustaining evaluated signals at said first output terminal.

8. The method of claim 7, wherein said apparatus evaluates at least two dual-monotonic signals and further includes a second output terminal, second resetting means, and a second evaluating means, wherein said method 5 further comprises the steps of:

resetting said second output terminal in a resetting phase;

evaluating said second terminal in an evaluating phase;

10 setting said second resetting means and second setting in inactive states; and

sustaining evaluated signals at said second output terminal.

9. The method of claim 10, further comprising the steps of:

evaluating first states of two dual-monotonic signals at said first output terminal in an AND logic 5 relation; and

evaluating second states of two dual-monotonic signals at said second output terminal in an OR logic relation.

10. An apparatus, comprising:

a plurality of function blocks with each of said function blocks able to generate a logic output, wherein a logic output generated by a trailing function block

5 provides a logic input of a preceding function block;

a plurality of detectors, each of said detectors being associated with one of said function blocks, for generating a completion signal upon detecting that a preceding function block has generated a logic output

10 in response to a logic output of a trailing function block; and

means for resetting said trailing function block in response said completion signal.

11. The apparatus of claim 10, further comprising:  
means for generating a final logic output in response to logic outputs from said function blocks.
12. The apparatus of claim 11, further comprising:  
means for generating a final completion signal upon detection of said final logic output.
13. The apparatus of claim 11, wherein each of said function blocks is domino circuit.
14. The apparatus of claim 13, wherein said domino circuit receives dual-monotonic signals, said domino circuit comprising:
  - 5 a first output terminal, said first output terminal operating in a reset phase, a evaluation phase and a storage phase;
  - first resetting means, coupled to said first output terminal, for resetting said first output terminal in a resetting phase;
  - 10 a first evaluating means, coupled to said first output terminal, for evaluating at least one signal in said evaluation phase;
  - 15 a second output terminal, said second output terminal operating in said reset, evaluation and said storage phases;
  - second resetting means, coupled to said second output terminal, for resetting said second output terminal in said resetting phase; and
  - 20 a second evaluating means, coupled to said second output terminal, for evaluating said second states of said at least two dual-monotonic signals in said evaluation phase;
  - 25 wherein said first resetting means and said first evaluating means are in inactive states during said storage phase, and said first output terminal sustains and provides said evaluated signals during said storage;

wherein said second resetting means and said second evaluating means are in inactive states during said storage phase, and said second output terminal sustains and provides said evaluated monotonic signals during said 5 storage.

15. The apparatus of claim 14,

wherein said first evaluating means includes two transistor means, two control terminals of said two transistor means receive said first states of two 5 dual-monotonic signals, and wherein said two transistor means are connected to said first control terminal in series; and

wherein said second evaluating means includes two transistor means, two control terminals of said two 10 transistor means receive said second states of two dual-monotonic signals, and wherein said two transistor means are connected to said second control terminal in parallel.

16. The apparatus of claim 11,

wherein said logic outputs generated by said function blocks are provided only in forward fashion, and wherein said completion signals are provided only 5 in backward handshake fashion.

17. The apparatus of claim 16, wherein connections of said function blocks forms a limited path, wherein said apparatus further comprising:

a plurality of control blocks for controlling said 5 detectors and resetting means;

wherein said function blocks and control blocks are designed so that no control blocks are located at said limited path so that raw combinational latencies are determined by said function blocks and will not be affected 10 by control blocks.

18. The apparatus of claim 17, wherein if any said control blocks are located at said limited path, such control block can be removed from said limited path by altering said function blocks and said control blocks.

19. The apparatus of claim 18, wherein each of control block includes detector and resetting means.

20. The apparatus of claim 17, wherein said function blocks are connected in a ring fashion.

21. The apparatus of claim 16, further comprising:  
means for merging at least two logic outputs generated by at least two trailing function blocks to one data path so that said at least two logic outputs  
5 are provided to one preceding function block.

22. The apparatus of claim 16, further comprising:  
means for splitting a logic output generated by a trailing function block to at least two data paths so that said logic output is provided to at least two  
5 preceding function blocks.

23. A method in use with an apparatus, said apparatus including a plurality function blocks with each of said function blocks able to generate a logic output, wherein a logic output generated by a trailing function block  
5 provides a logic input of a preceding function block, said method comprising the steps of:  
detecting completion of generation of logic output for each of said block function;  
generating a completion signal upon detecting that  
10 a preceding function block has generated a logic output in response to a logic output of a trailing function block; and  
resetting said trailing function block in response  
said completion signal.

24. The method of claim 23, further comprising the step of:

generating a final logic output in response to logic outputs from said function blocks.

25. The method of claim 24, further comprising the step of:

generating a final completion signal upon detection of said final logic output.

26. The method of claim 25, wherein connections of said function blocks form a limited path, wherein said apparatus further including a plurality of control blocks for controlling said detectors and resetting means,

5 wherein said method further comprising the step of:

arranging control blocks outside of said limited path so that raw combinational latencies are determined by said function blocks and will not be affected by said control blocks.

27. The method of claim 26, in case that any said control blocks are located at said limited part, further comprising the step of:

5 granizing said function blocks and said control blocks so that such control blocks can be removed from said limited path.

28. An apparatus for generating a logic output, comprising:

a plurality of paths, each of said paths generates a intermediate in response to plurality of logic inputs;

5 wherein one of said paths has high probability of usage; and

wherein said devices in said path have wider conduct path to increase speed of said path.

29. An method for designing an apparatus for generating a logic output, said apparatus includes a plurality of paths, and each of said paths generates a intermediate in response to plurality of logic inputs, said method 5 comprising the steps of:

determining which of the paths has high probability of usage; and

creating wider conducting path for devices in said path to increase speed of said path.

30. An apparatus for generating a pre-determined logic output, comprising:

a plurality of paths, each of said paths generating an intermediate signal in response to a plurality of 5 logic inputs;

wherein one of said paths has high probability of usage, and other of said paths have a lower probability of usage;

wherein said devices in said path are simplified 10 to shorten logic blocks to increase speed of said path; and

wherein said devices in said some other paths are modified to generate equivalence of said pre-determined logic output.

31. A method for designing an apparatus for generating a pre-determined logic output, said apparatus includes a plurality of paths, each of said paths generates an intermediate response to a plurality of logic inputs,

5 said method comprising:

determining a path that has high probability of usage;

determining other of said paths that have lower probability of usage;

10 simplifying devices in said path to shorten logic blocks; and

modifying devices in said some other paths to generate equivalence of said pre-determined logic output.

32. An apparatus for generating a division result, said apparatus including a plurality of function blocks, each of said function blocks generating a remainder and a quotient digit, said remainder and quotient digit 5 generated by a trailing function block comprising inputs for a preceding function block, each of said function blocks comprising:

means for generating a remainder in response to a remainder and a quotient digit generated by said trailing 10 function block; and

means for generating a quotient digit in response to said remainder and said quotient digit generated by said trailing function block;

wherein said remainder and said quotient digits 15 generated by said trailing function block are provided to said remainder generating means and said quotient generating means in parallel.

33. The apparatus of claim 32, wherein said remainder generating means further comprising:

means for providing an original state of said quotient digit generated by said trailing function block, 5 an original state of said quotient digit generated by said trailing function block, or a zero state; and

means for adding a remainder generated by said trailing function block and an output from said providing means;

10 wherein said adding means is capable of saving whole bit carries.

34. The apparatus of claim 33, wherein said quotient digit generating means further comprises:

5 first adding means for adding a remainder generated by said trailing function block and said original state of said quotient digit generated by said trailing function block, wherein said first adding means is capable of saving partial bit carries;

10 second adding means for adding a remainder generated by said trailing function block and said opposite state of said quotient digit generated by said trailing function block, wherein said first adding means is capable of saving partial bit carries; and

means for selecting outputs from said first adding means, said second adding means and said zero state.

35. The apparatus of claim 32 or 34, each function block operates in a resetting phase, an operation phase and a storage phase, said apparatus further comprising:

5 a plurality of detectors, with each of said detectors associated with one of said function blocks, for generating a completion signal upon detecting that a preceding function block has generated a remainder and a quotient digit in response to a remainder and a quotient digit generated by a trailing function block;

10 and

means for resetting said trailing function block in response said completion signal.

36. The apparatus of claim 35,

wherein said remainders and quotient digits generated by said function blocks are provided only in forward fashion, and wherein said completion signals are 5 provided only in backward handshake fashion.

37. The apparatus of claim 36, wherein connections of said function blocks form a limited path, wherein said apparatus further comprising:

a plurality of control blocks for controlling said detectors and resetting means;

5 wherein said function blocks and control blocks are designed so that no control blocks are located at said limited path so that raw combinational latencies are determined by said function blocks and will not be affected by said control blocks.

38. The apparatus of claim 37, wherein if any said control blocks are located at said limited part, such control block can be removed out of said limited path by granizing said function blocks and said control 5 blocks.

39. A method used with an apparatus for generating a division result, said apparatus includes a plurality of function blocks, each of said function blocks generating a remainder and a quotient digit, said remainder and 5 quotient digit generated by a trailing function block serve input for a preceding function block, each of said function block comprising the steps of:

generating a remainder in response to a remainder and a quotient digit generated by a trailing function 10 block; and

generating a quotient digit in response to said remainder and said quotient digit generated by said trailing function block;

15 wherein said remainder generating step and quotient digit generating step receives said remainder and quotient digit generated by said trailing function block in parallel.

40. The method of claim 39, wherein said remainder generating step further comprises the steps of:

providing original state of said quotient digit generated by said trailing function block, original state

of said quotient digit generated by said trailing function block, or zero state; and

adding a remainder generated by said trailing function block and an output from said providing means;

5 wherein said adding step is capable of saving whole bit carries.

40. The method of claim 33, wherein said quotient digit generating means step comprises the steps of:

a first adding step for adding a remainder generated by said trailing function block and said original state 5 of said quotient digit generated by said trailing function block, wherein said first adding means is capable of saving partial bit carries;

a second adding step for adding a remainder generated by said trailing function block and said 10 opposite state of said quotient digit generated by said trailing function block, wherein said first adding means is capable of saving partial bit carries; and

selecting outputs from said first adding step, second adding step and zero state.

41. The method of claim 38 or 40, each function block operating under a resetting phase, an operation phase and a storage phase, said apparatus including a plurality of detectors, with each of said detectors associated with 5 one of said function blocks, said method further comprising the steps of:

generating a completion signal upon detecting that a preceding function block has generated a remainder and a quotient digit in response to a remainder and a 10 quotient digit generated by a trailing function block; and

resetting said trailing function block in response said completion signal.



**Figure 1**

2/9



Figure 2

3/9



Figure 3



Figure 4



Figure 5

6/9



Figure 6

7/9



Figure 7



Figure 8

9/9



Figure 9

## INTERNATIONAL SEARCH REPORT

International Application No. PCT/US91/07256

## I. CLASSIFICATION OF SUBJECT MATTER (If several classification symbols apply, indicate all) \*

According to International Patent Classification (IPC) or to both National Classification and IPC

INT. CL. 5 G11C 8/00

U.S. CL. 307/452,443,272.2,279 365/203, 189.04,222

## II. FIELDS SEARCHED

| Classification System                                                                                                              | Minimum Documentation Searched 7                    |  |
|------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------|--|
|                                                                                                                                    | Classification Symbols                              |  |
| U.S.                                                                                                                               | 307/443,445,452,272.2,279,480<br>365/203,189.04,222 |  |
| Documentation Searched other than Minimum Documentation<br>to the Extent that such Documents are Included in the Fields Searched * |                                                     |  |

## III. DOCUMENTS CONSIDERED TO BE RELEVANT \*

| Category * | Citation of Document, 11 with indication, where appropriate, of the relevant passages 12 | Relevant to Claim No. 13 |
|------------|------------------------------------------------------------------------------------------|--------------------------|
| A          | U S , A, 4,856,029 GEYER ET.AL., 08 AUGUST 1989                                          |                          |
| A          | U S , A, 4,758,990, UCHIDA, 19 JULY 1988                                                 |                          |
| A          | U S , A, 4,710,650 SHOJI, 1 DECEMBER 1987                                                |                          |
| A,P        | U S , A, 4,999,528 KEECH, 12 MARCH 1991                                                  |                          |
| A          | U S , A, 4,747,082 MINATO ET. AL., 24 MAY 1988                                           |                          |
| A          | U S , A, 4,631,701 KAPPELER ET. AL. 23 DECEMBER 1986                                     |                          |

\* Special categories of cited documents: 10

"A" document defining the general state of the art which is not considered to be of particular relevance

"E" earlier document but published on or after the international filing date

"L" document which may throw doubts on priority claim(s) or which is cited to establish the publication date of another citation or for other special reason (as specified)

"O" document referring to an oral disclosure, use, exhibition or other means

"P" document published prior to the international filing date but later than the priority date claimed

"T" later document published after the international filing date or priority date and not in conflict with the application but cited to understand the principle or theory underlying the invention

"X" document of particular relevance; the claimed invention cannot be considered novel or cannot be considered to involve an inventive step

"Y" document of particular relevance; the claimed invention cannot be considered to involve an inventive step when the document is combined with one or more other such documents, such combination being obvious to a person skilled in the art

"Z" document member of the same patent family

## IV. CERTIFICATION

Date of the Actual Completion of the International Search

19 December 1991

Date of Mailing of this International Search Report

24 JAN 1992

International Searching Authority

ISA/US

Signature of Authorized Officer: NGOC-NO

INTERNATIONAL DIVISION

In S. D. MILLER Nguyen



Figure 1



Figure 2

3/9



Figure 3



Figure 4



Figure 5

6/9



Figure 6

7/9



Figure 7



Figure 8

9/9



Figure 9

**This Page is Inserted by IFW Indexing and Scanning  
Operations and is not part of the Official Record**

### **BEST AVAILABLE IMAGES**

Defective images within this document are accurate representations of the original documents submitted by the applicant.

Defects in the images include but are not limited to the items checked:

- BLACK BORDERS**
- IMAGE CUT OFF AT TOP, BOTTOM OR SIDES**
- FADED TEXT OR DRAWING**
- BLURRED OR ILLEGIBLE TEXT OR DRAWING**
- SKEWED/SLANTED IMAGES**
- COLOR OR BLACK AND WHITE PHOTOGRAPHS**
- GRAY SCALE DOCUMENTS**
- LINES OR MARKS ON ORIGINAL DOCUMENT**
- REFERENCE(S) OR EXHIBIT(S) SUBMITTED ARE POOR QUALITY**
- OTHER:** \_\_\_\_\_

**IMAGES ARE BEST AVAILABLE COPY.**

**As rescanning these documents will not correct the image problems checked, please do not report these problems to the IFW Image Problem Mailbox.**